Stand-off Annotation of Web Content as a Legally Safer Alternative to Crawling for Distribution

نویسندگان

  • Mikel L. FORCADA
  • Miquel ESPLÀ-GOMIS
  • Juan Antonio PÉREZ-ORTIZ
چکیده

Sentence-aligned web-crawled parallel text or bitext is frequently used to train statistical machine translation systems. To that end, web-crawled sentence-aligned bitext sets are sometimes made publicly available and distributed by translation technologies practitioners. Contrary to what may be commonly believed, distribution of web-crawled text is far from being free from legal implications, and may sometimes actually violate the usage restrictions. As the distribution and availability of sentence-aligned bitext is key to the development of statistical machine translation systems, this paper proposes an alternative: instead of copying and distributing copies of web content in the form of sentence-aligned bitext, one could distribute a legally safer stand-off annotation of web content, that is, files that identify where the aligned sentences are, so that end users can use this annotation to privately recrawl the bitexts. The paper describes and discusses the legal and technical aspects of this proposal, and outlines an implementation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

AO: An Open Annotation Ontology for Science on the Web

We present the Annotation Ontology (AO), an open ontology in OWL for annotating scientific documents on the web. AO supports both human and algorithmic content annotation. It enables “stand-off” (separate) metadata anchored to specific positions in a web document by any one of several methods. In AO, the document may be annotated but is not required to be under update control of the annotator. ...

متن کامل

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Practical applications of stand-off annotation

An information system that makes use of stand-off annotation stores metadata separately from the data they describe. System architectures separate metadata from data in order to cope with heterogeneous annotations or with multimedia formats. This paper discusses some of the practical aspects of implementing an information system with a stand-off architecture. Two systems that use stand-off anno...

متن کامل

TEITOK: Text-Faithful Annotated Corpora

TEITOK is a web-based framework for corpus creation, annotation, and distribution, that combines textual and linguistic annotation within a single TEI based XML document. TEITOK provides several built-in NLP tools to automatically (pre)process texts, and is highly customizable. It features multiple orthographic transcription layers, and a wide range of user-defined token-based annotations. For ...

متن کامل

Web-crawling reliability

In this article, I investigate the reliability, in the social science sense, of collecting informetric data about the World Wide Web by Web crawling. The investigation includes a critical examination of the practice of Web crawling and contrasts the results of content crawling with the results of link crawling. It is shown that Web crawling by search engines is intentionally biased and selectiv...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016